深度神经网络(DNN)模型和数据集的快速增长大小引起了各种分布策略,如数据,张量模型,管道并行和其混合组合。这些策略中的每一个都提供自己的权衡,并在不同的模型和硬件拓扑上展示最佳性能。选择给定设置的最佳策略集是具有挑战性的,因为搜索空间组合增长,并且在群集上调试和测试昂贵。在这项工作中,我们提出了DISTIR,对于分布式DNN计算,这是针对有效分析而定制的分布式DNN计算的表达中间表示,例如模拟。这使得能够自动识别顶级执行策略,而无需在物理硬件上执行。与事先工作不同,Distir自然可以表达许多分发策略,包括管道并行性,具有任意时间表。我们对MLP培训和GPT-2推理模型的评估演示了DISTIR及其模拟器启用快速网格在跨越1000多种配置的复杂分配空间上搜索,以某些制度的数量级递减优化时间。
translated by 谷歌翻译
安全部署到现实世界的机器学习模式通常是一个具有挑战性的过程。从特定地理位置获得的数据训练的模型往往会在询问其他地方获得的数据时失败,在仿真中培训的代理可以在部署在现实世界或新颖的环境中进行适应时,以及适合于拟合的神经网络人口可能会将一些选择偏见纳入其决策过程。在这项工作中,我们描述了(i)通过(i)识别和描述了不同误差来源的新信息 - 理论观点的数据转移问题,(ii)比较最近域概括和公平探讨的一些最有前景的目标分类文献。从我们的理论分析和实证评估中,我们得出结论,需要通过关于观察到的数据,用于校正的因素的仔细考虑和数据生成过程的结构来指导模型选择程序。
translated by 谷歌翻译
Parallel implementations of stochastic gradient descent (SGD) have received significant research attention, thanks to its excellent scalability properties. A fundamental barrier when parallelizing SGD is the high bandwidth cost of communicating gradient updates between nodes; consequently, several lossy compresion heuristics have been proposed, by which nodes only communicate quantized gradients. Although effective in practice, these heuristics do not always converge. In this paper, we propose Quantized SGD (QSGD), a family of compression schemes with convergence guarantees and good practical performance. QSGD allows the user to smoothly trade off communication bandwidth and convergence time: nodes can adjust the number of bits sent per iteration, at the cost of possibly higher variance. We show that this trade-off is inherent, in the sense that improving it past some threshold would violate information-theoretic lower bounds. QSGD guarantees convergence for convex and non-convex objectives, under asynchrony, and can be extended to stochastic variance-reduced techniques. When applied to training deep neural networks for image classification and automated speech recognition, QSGD leads to significant reductions in end-to-end training time. For instance, on 16GPUs, we can train the ResNet-152 network to full accuracy on ImageNet 1.8× faster than the full-precision variant. time to the same target accuracy is 2.7×. Further, even computationally-heavy architectures such as Inception and ResNet can benefit from the reduction in communication: on 16GPUs, QSGD reduces the end-to-end convergence time of ResNet152 by approximately 2×. Networks trained with QSGD can converge to virtually the same accuracy as full-precision variants, and that gradient quantization may even slightly improve accuracy in some settings. Related Work. One line of related research studies the communication complexity of convex optimization. In particular, [40] studied two-processor convex minimization in the same model, provided a lower bound of Ω(n(log n + log(1/ ))) bits on the communication cost of n-dimensional convex problems, and proposed a non-stochastic algorithm for strongly convex problems, whose communication cost is within a log factor of the lower bound. By contrast, our focus is on stochastic gradient methods. Recent work [5] focused on round complexity lower bounds on the number of communication rounds necessary for convex learning.Buckwild! [10] was the first to consider the convergence guarantees of low-precision SGD. It gave upper bounds on the error probability of SGD, assuming unbiased stochastic quantization, convexity, and gradient sparsity, and showed significant speedup when solving convex problems on CPUs. QSGD refines these results by focusing on the trade-off between communication and convergence. We view quantization as an independent source of variance for SGD, which allows us to employ standard convergence results [7]. The main differences from Buckw
translated by 谷歌翻译
Generative neural samplers are probabilistic models that implement sampling using feedforward neural networks: they take a random input vector and produce a sample from a probability distribution defined by the network weights. These models are expressive and allow efficient computation of samples and derivatives, but cannot be used for computing likelihoods or for marginalization. The generativeadversarial training method allows to train such models through the use of an auxiliary discriminative neural network. We show that the generative-adversarial approach is a special case of an existing more general variational divergence estimation approach. We show that any f -divergence can be used for training generative neural samplers. We discuss the benefits of various choices of divergence functions on training complexity and the quality of the obtained generative models.
translated by 谷歌翻译
We investigate the capacity, convexity and characterization of a general family of normconstrained feed-forward networks.
translated by 谷歌翻译
深度神经网络(DNN)众所周知,很容易受到对抗例子的影响(AES)。此外,AE具有对抗性可传递性,这意味着为源模型生成的AE可以以非平凡的概率欺骗另一个黑框模型(目标模型)。在本文中,我们首次研究了包括Convmixer在内的模型之间的对抗性转移性的属性。为了客观地验证可转让性的属性,使用称为AutoAttack的基准攻击方法评估模型的鲁棒性。在图像分类实验中,Convmixer被确认对对抗性转移性较弱。
translated by 谷歌翻译
提出了一种使用秘密钥匙访问控制的新方法,以保护模型免受本文未经授权的访问。我们专注于具有视觉变压器(VIT)的语义分割模型,称为分割变压器(SETR)。大多数现有的访问控制方法都集中在图像分类任务上,或者仅限于CNN。通过使用VIT拥有的贴片嵌入结构,可以用秘密键有效地对经过训练的模型和测试图像进行加密,然后在加密的域中执行语义分割任务。在一个实验中,该方法被确认提供了与使用普通图像无需任何加密的授权用户具有正确键的准确性,并为未经授权的用户提供了极度退化的准确性。
translated by 谷歌翻译
在本文中,我们提出了一种使用秘密钥匙的Convmixer模型的加密方法。已经研究了DNN模型的加密方法,以实现对抗性防御,模型保护和隐私图像分类。但是,与普通模型相比,常规加密方法的使用降低了模型的性能。因此,我们提出了一种加密交流模型的新方法。该方法是基于Convmixer拥有的嵌入体系结构进行的,并且使用该方法加密的模型才能具有与使用秘密键加密的测试图像时使用普通图像训练的模型相同的性能。此外,提出的方法不需要任何特殊准备的数据进行模型培训或网络修改。在实验中,在CIFAR10数据集中的图像分类任务中,根据分类精度和模型保护评估了所提出方法的有效性。
translated by 谷歌翻译
在本文中,我们提出了转换图像和视觉变压器(VIT)模型的组合使用,该模型用秘密键转换。我们首次展示了经过普通图像训练的模型可以直接转换为根据VIT体系结构训练的模型,并且使用测试图像时,转换模型的性能与经过纯图像训练的模型相同用钥匙加密。此外,提出的方案不需要任何特殊准备的数据进行培训模型或网络修改,因此它还使我们可以轻松更新秘密密钥。在实验中,在CIFAR-10数据集中的图像分类任务中,根据性能降解和模型保护性能评估了提出方案的有效性。
translated by 谷歌翻译
经过深入的研究,最低限度的损失景观的局部形状,尤其是平坦度对于深层模型的概括起重要作用。我们开发了一种称为POF的培训算法:特征提取器的训练后培训,该培训更新了已经训练的深层模型的特征提取器部分,以搜索最小的最小值。特征是两倍:1)特征提取器在高层参数空间中的参数扰动下受到训练,基于表明使更高层参数空间变平的观测值,以及2)扰动范围以数据驱动的方式确定旨在减少由正损失曲率引起的一部分测试损失。我们提供了理论分析,该分析表明所提出的算法隐含地减少了目标Hessian组件以及损失。实验结果表明,POF仅针对CIFAR-10和CIFAR-100数据集的基线方法提高了模型性能,仅用于10个上学后培训,以及用于50个上学后培训的SVHN数据集。源代码可用:\ url {https://github.com/densoitlab/pof-v1
translated by 谷歌翻译